# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.htmldf = pd.read_csv('StarWars.csv', encoding='ISO-8859-1')
Elevator pitch
In this project we are examining Star Wars survey results as well as using a machine learning model to predict whether a respondent makes more than 50k a year. We learn that the best predicter of income was education and not someone’s Star Wars preferences.
QUESTION|TASK 1
Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
Show the code
df = df.rename(columns={'Have you seen any of the 6 films in the Star Wars franchise?': 'seen_any_sw_films','Do you consider yourself to be a fan of the Star Wars film franchise?': 'fan','Which of the following Star Wars films have you seen? Please select all that apply.': 'what_movies','Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'rank', 'Unnamed: 10': 'rank2', 'Unnamed: 11': 'rank3','Unnamed: 12': 'rank4', 'Unnamed: 13': 'rank5', 'Unnamed: 14': 'rank6','Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'character_rating','Which character shot first?': 'shot_first','Are you familiar with the Expanded Universe?': 'eu','Do you consider yourself to be a fan of the Expanded Universe?': 'eu_fan','Do you consider yourself to be a fan of the Star Trek franchise?': 'star_trek_fan','Household Income': 'income','Location (Census Region)': 'location','Unnamed: 4': 'movie_2','Unnamed: 5': 'movie_3','Unnamed: 6': 'movie_4','Unnamed: 7': 'movie_5','Unnamed: 8': 'movie_6'})df= df.drop(columns=['RespondentID', 'fan' ,'character_rating','Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19','Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23','Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27','Unnamed: 28', 'eu', 'eu_fan', 'star_trek_fan','location'])
Show the code
changes = [ ("Have you seen any of the 6 films in the Star Wars franchise?", "seen_any_sw_films"), ("Do you consider yourself to be a fan of the Star Wars film franchise?", "fan"), ("Which of the following Star Wars films have you seen? Please select all that apply.", "what_movies"), ('Unnamed: 4', 'movie_2'), ('Unnamed: 5', 'movie_3'), ('Unnamed: 6', 'movie_4'), ('Unnamed: 7', 'movie_5'), ('Unnamed: 8', 'movie_6'), ("Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.", "rank"), ('Unnamed: 10', 'rank2'), ('Unnamed: 11', 'rank3'), ('Unnamed: 12', 'rank4'), ('Unnamed: 13', 'rank5'), ('Unnamed: 14', 'rank6'), ("Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.", "character_rating"), ("Which character shot first?", "shot_first"), ("Are you familiar with the Expanded Universe?", "eu"), ("Do you consider yourself to be a fan of the Expanded Universe?", "eu_fan"), ("Do you consider yourself to be a fan of the Star Trek franchise?", "star_trek_fan"), ("Household Income", "income"), ("Location (Census Region)", "location")]def print_table(changes, width=60):print(f"{'Original Column Name':<60} | {'New Column Name'}")print("-"*80)for original, new in changes: wrapped = wrap(original, width)print(f"{wrapped[0]:<60} | {new}")for line in wrapped[1:]:print(f"{line:<60} | ")print_table(changes)
Original Column Name | New Column Name
--------------------------------------------------------------------------------
Have you seen any of the 6 films in the Star Wars franchise? | seen_any_sw_films
Do you consider yourself to be a fan of the Star Wars film | fan
franchise? |
Which of the following Star Wars films have you seen? Please | what_movies
select all that apply. |
Unnamed: 4 | movie_2
Unnamed: 5 | movie_3
Unnamed: 6 | movie_4
Unnamed: 7 | movie_5
Unnamed: 8 | movie_6
Please rank the Star Wars films in order of preference with | rank
1 being your favorite film in the franchise and 6 being your |
least favorite film. |
Unnamed: 10 | rank2
Unnamed: 11 | rank3
Unnamed: 12 | rank4
Unnamed: 13 | rank5
Unnamed: 14 | rank6
Please state whether you view the following characters | character_rating
favorably, unfavorably, or are unfamiliar with him/her. |
Which character shot first? | shot_first
Are you familiar with the Expanded Universe? | eu
Do you consider yourself to be a fan of the Expanded | eu_fan
Universe? |
Do you consider yourself to be a fan of the Star Trek | star_trek_fan
franchise? |
Household Income | income
Location (Census Region) | location
QUESTION|TASK 2
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
b. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
c. Create a new column that converts the education groupings to a single number. Drop the school categorical column
d. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
e. Create your target (also known as “y” or “label”) column based on the new income range column
f. One-hot encode all remaining categorical columns
We can see that the data is only showing people who have seen at least one film. I converted the age, education, and income to single number groupings. I also created a target y column by making a column that has 1 if the income is more than or equal to 50,000 and is 0 if it is less than 50,000. I also made all the other categories numerical in order to work with a machine learning model.
Show the code
# adf = df[df['seen_any_sw_films'] =='Yes']# bage_map = {'18-29': 25,'30-44': 37,'45-60': 53,'> 60': 61}df['age'] = df['Age'].map(age_map)df = df.drop(columns=['Age'])# cedu_map = {'Less than high school degree': 1,'High school degree': 2,'Some college or Associate degree': 3,'Bachelor degree': 4,'Graduate degree': 5}df['education_num'] = df['Education'].map(edu_map)df = df.drop(columns=['Education'])# dincome_map = {'$0 - $24,999': 0,'$25,000 - $49,999': 25000,'$50,000 - $99,999': 50000,'$100,000 - $149,999': 100000,'$150,000+': 150000}df['income_num'] = df['income'].map(income_map)df = df.drop(columns=['income'])# edf['income_target'] = (df['income_num'] >=50000).astype(int)y = df['income_target']# fmovie_cols = ['what_movies', 'movie_2', 'movie_3', 'movie_4', 'movie_5', 'movie_6']df['movies_seen_list'] = df[movie_cols].apply(lambda row: [m for m in row.dropna()], axis=1)all_movies =sorted(set(m for lst in df['movies_seen_list'] for m in lst))for movie in all_movies: safe_name = movie.replace(" ", "_").replace(":", "").replace("-", "").replace("'", "") df[f"seen_{safe_name}"] = df['movies_seen_list'].apply(lambda lst: int(movie in lst))df = df.drop(columns=movie_cols + ['movies_seen_list'])rank_cols = ['rank', 'rank2', 'rank3', 'rank4', 'rank5', 'rank6']df[rank_cols] = df[rank_cols].astype(str)df['rank_list'] = df[rank_cols].apply(lambda row: [m for m in row if m notin ["", "nan", "None"]], axis=1)all_movies =sorted(set(m for lst in df['rank_list'] for m in lst))for movie in all_movies: safe_name = movie.replace(" ", "_").replace(":", "").replace("-", "").replace("'", "") df[f"ranked_{safe_name}"] = df['rank_list'].apply(lambda lst: int(movie in lst)) df[f"rank_value_{safe_name}"] = df[rank_cols].apply(lambda row: row.tolist().index(movie) +1if movie in row.tolist() elseNone, axis=1 )df = df.drop(columns=rank_cols + ['rank_list'])binary_cols = ['seen_any_sw_films', 'Gender']df[binary_cols] = df[binary_cols].astype(str)ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)encoded = ohe.fit_transform(df[binary_cols])clean_names = [ name.replace(" ", "_") .replace(":", "") .replace("-", "") .replace("'", "")for name in ohe.get_feature_names_out(binary_cols)]encoded_df = pd.DataFrame(encoded, columns=clean_names, index=df.index)df = pd.concat([df, encoded_df], axis=1)df.head(10)
seen_any_sw_films
shot_first
Gender
age
education_num
income_num
income_target
seen_Star_Wars_Episode_I__The_Phantom_Menace
seen_Star_Wars_Episode_II__Attack_of_the_Clones
seen_Star_Wars_Episode_III__Revenge_of_the_Sith
...
ranked_4
rank_value_4
ranked_5
rank_value_5
ranked_6
rank_value_6
seen_any_sw_films_Yes
Gender_Female
Gender_Male
Gender_nan
1
Yes
I don't understand this question
Male
25.0
2.0
NaN
0
1
1
1
...
1
4.0
1
5.0
1
6.0
1.0
0.0
1.0
0.0
3
Yes
I don't understand this question
Male
25.0
2.0
0.0
0
1
1
1
...
1
4.0
1
5.0
1
6.0
1.0
0.0
1.0
0.0
4
Yes
I don't understand this question
Male
25.0
3.0
100000.0
1
1
1
1
...
1
5.0
1
1.0
1
2.0
1.0
0.0
1.0
0.0
5
Yes
Greedo
Male
25.0
3.0
100000.0
1
1
1
1
...
1
2.0
1
1.0
1
3.0
1.0
0.0
1.0
0.0
6
Yes
Han
Male
25.0
4.0
25000.0
0
1
1
1
...
1
2.0
1
5.0
1
4.0
1.0
0.0
1.0
0.0
7
Yes
Han
Male
25.0
2.0
NaN
0
1
1
1
...
1
3.0
1
2.0
1
1.0
1.0
0.0
1.0
0.0
8
Yes
Han
Male
25.0
2.0
NaN
0
1
1
1
...
1
1.0
1
2.0
1
3.0
1.0
0.0
1.0
0.0
9
Yes
Han
Male
25.0
3.0
0.0
0
1
1
1
...
1
2.0
1
1.0
1
3.0
1.0
0.0
1.0
0.0
10
Yes
I don't understand this question
Male
25.0
3.0
25000.0
0
0
1
0
...
1
4.0
1
5.0
1
6.0
1.0
0.0
1.0
0.0
11
Yes
NaN
nan
NaN
NaN
NaN
0
0
0
0
...
0
NaN
0
NaN
0
NaN
1.0
0.0
0.0
1.0
10 rows × 29 columns
QUESTION|TASK 3
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
Here are graphs that show the percentages of which Star Wars movies the respondents have seen and which movies they ranked as their top movie. The percentages match the article so we will move forward and create our machine learning model.
Show the code
e1_percent =round((df['seen_Star_Wars_Episode_I__The_Phantom_Menace'].sum() /835) *100)e2_percent =round((df['seen_Star_Wars_Episode_II__Attack_of_the_Clones'].sum() /835) *100)e3_percent =round((df['seen_Star_Wars_Episode_III__Revenge_of_the_Sith'].sum() /835) *100)e4_percent =round((df['seen_Star_Wars_Episode_IV__A_New_Hope'].sum() /835) *100)e5_percent =round((df['seen_Star_Wars_Episode_V_The_Empire_Strikes_Back'].sum() /835) *100)e6_percent =round((df['seen_Star_Wars_Episode_VI_Return_of_the_Jedi'].sum() /835) *100)summary = pd.DataFrame({"Movie": ["Episode I – The Phantom Menace","Episode II – Attack of the Clones","Episode III – Revenge of the Sith","Episode IV – A New Hope","Episode V – The Empire Strikes Back","Episode VI – Return of the Jedi" ],"Percent_Seen": [ e1_percent, e2_percent, e3_percent, e4_percent, e5_percent, e6_percent ]})summary["Movie"] = pd.Categorical( summary["Movie"], categories=["Episode VI – Return of the Jedi","Episode V – The Empire Strikes Back","Episode IV – A New Hope","Episode III – Revenge of the Sith","Episode II – Attack of the Clones","Episode I – The Phantom Menace" ], ordered=True)summary["Percent_Label"] = summary["Percent_Seen"].round().astype(int).astype(str)( ggplot(summary, aes(x='Percent_Seen', y='Movie')) + geom_bar(stat='identity', width =0.8) + geom_text( aes(label='Percent_Label'), position=position_nudge(x=2.5), size=6 ) + labs(title ="Which 'Star Wars' Movies Have You Seen?", subtitle="Of 835 respondents who have seen any film") + xlab("") + ylab("") + ggsize(600, 200)+ theme(axis_text_y=element_text(size=11),axis_text_x=element_blank(), plot_margin=[5, 20, 5, -50], panel_grid_major='blank', panel_grid_minor='blank', axis_ticks_x='blank', plot_title=element_text(size=20, face="bold",hjust=-.22), plot_subtitle=element_text(hjust=-.22)))
Show the code
movie_cols = ['seen_Star_Wars_Episode_I__The_Phantom_Menace','seen_Star_Wars_Episode_II__Attack_of_the_Clones','seen_Star_Wars_Episode_III__Revenge_of_the_Sith','seen_Star_Wars_Episode_IV__A_New_Hope','seen_Star_Wars_Episode_V_The_Empire_Strikes_Back','seen_Star_Wars_Episode_VI_Return_of_the_Jedi']df_all_seen = df[df[movie_cols].sum(axis=1) ==6]rank_cols = ['rank_value_1', 'rank_value_2', 'rank_value_3','rank_value_4', 'rank_value_5', 'rank_value_6']df_all_seen[rank_cols] = df_all_seen[rank_cols].astype(str)all_ranked_movies =sorted(set(m for row in df_all_seen[rank_cols].values for m in row if m notin ["", "nan", "None"]))movie_first_place_counts = {}for movie in all_ranked_movies: count_first = (df_all_seen['rank_value_1'] == movie).sum() pct_first =round((count_first /len(df_all_seen)) *100, 2) movie_first_place_counts[movie] = pct_firstsummary_rank = pd.DataFrame({"Movie": list(movie_first_place_counts.keys()),"Percent_First": list(movie_first_place_counts.values())})rank_to_movie = {"1.0": "Episode I – The Phantom Menace","2.0": "Episode II – Attack of the Clones","3.0": "Episode III – Revenge of the Sith","4.0": "Episode IV – A New Hope","5.0": "Episode V – The Empire Strikes Back","6.0": "Episode VI – Return of the Jedi"}summary_rank["Movie"] = summary_rank["Movie"].map(rank_to_movie)summary_rank = summary_rank.sort_values("Percent_First", ascending=False)movie_order = ["Episode VI – Return of the Jedi","Episode V – The Empire Strikes Back","Episode IV – A New Hope","Episode III – Revenge of the Sith","Episode II – Attack of the Clones","Episode I – The Phantom Menace"]summary_rank["Percent_Label"] = summary_rank["Percent_First"].round().astype(int).astype(str)( ggplot(summary_rank, aes(x='Percent_First', y='Movie')) + geom_bar(stat='identity', width =0.8) + geom_text( aes(label='Percent_Label'), position=position_nudge(x=1), size=6 ) + labs(title ="What's the Best 'Star Wars' Movie?", subtitle="Of 471 respondents who have seen all six films") + xlab("") + ylab("") + scale_y_discrete(limits=movie_order) + ggsize(600, 200)+ theme(axis_text_y=element_text(size=11),axis_text_x=element_blank(), plot_margin=[5, 20, 5, -50], panel_grid_major='blank', panel_grid_minor='blank', axis_ticks_x='blank', plot_title=element_text(size=20, face="bold",hjust=-.15), plot_subtitle=element_text(hjust=-.15)))
QUESTION|TASK 4
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.
My final model was a decision tree classifier with the columns ‘seen_any_sw_films’, ‘shot_first’, ‘Gender’, ‘age’, ‘income_num’, ’seen_Star_Wars_Episode_I__The_Phantom_Menace’, ’seen_Star_Wars_Episode_II__Attack_of_the_Clones’, ’seen_Star_Wars_Episode_III__Revenge_of_the_Sith’, ’seen_Star_Wars_Episode_IV__A_New_Hope’, ‘seen_Star_Wars_Episode_V_The_Empire_Strikes_Back’, ‘seen_Star_Wars_Episode_VI_Return_of_the_Jedi’, ‘ranked_1’, ‘rank_value_1’, ‘ranked_2’, ‘rank_value_2’, ‘ranked_3’, ‘rank_value_3’, ‘ranked_4’, ‘rank_value_4’, ‘ranked_5’, ‘rank_value_5’, ‘ranked_6’, ‘rank_value_6’, ‘seen_any_sw_films_Yes’, ‘Gender_Female’, ‘Gender_Male’, ‘Gender_nan’ removed. I removed most of these because their correlation with income was so low that they brought down the accuracy score when kept in. I reached a 65% accuracy score.
Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.
I reached a 65% accuracy above. It looks like education was the only factor that influenced income. Star Wars opinions did not help us discover someone’s salary.
STRETCH QUESTION|TASK 2
Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.
Create a new column that converts the location groupings to a single number. Drop the location categorical column.
This is the first 10 rows of the new numerical location column.
Show the code
# Include and execute your code heredf = pd.read_csv('StarWars.csv', encoding='ISO-8859-1')loc_map = {'East North Central':1,'East South Central':2,'Middle Atlantic':3,'Mountain':4,'New England':5,'Pacific':6,'Response':7,'South Atlantic':8,'West North Central':9,'West South Central':10}df['location_num'] = df['Location (Census Region)'].map(loc_map)df = df.drop(columns=['Location (Census Region)'])df[['location_num']].head(10)